Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $7 billion in funded loans. Borrowers request personal loans on Prosper and investors (individual or institutional) can fund anywhere from $2,000 to $35,000 per loan request. In addition to credit scores, ratings, and histories, investors can consider borrowers’ personal loan descriptions, endorsements from friends, and community affiliations. Prosper handles the servicing of the loan and collects and distributes borrower payments and interest back to the loan investors.[https://en.wikipedia.org/wiki/Prosper_Marketplace].Here I used the data available to the public from the institution and analyse the borrower market (including the demographic segmentation and beyond) and try to let data tell some ‘behind scene’ stories about the borrowers.
The top borrowers are customers that have income ranges from $25,000-$49,999 followed by those with income range of $50,000 - $74,999 and there are fewer zero income borrowers as well.
## [1] "The APR summary statistics is :"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
The loan APR ranges from less than 5 % up to about 51 % though there are only very few borrowers with such high and lower APR values. Most borrowers seems to have a APR from 10 % upto 30 % with somehow a substantial amount of borrowers found to have APR value of ~ 36 %. The median APR value is ~21 %.
## [1] "The interest rate summary statistics is :"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The borrower rate has roughly the same distribution as the borrower APR. However, the maximum rate is less than 50 %.
Most borrowers are employed and also there are few unemployed borrowers. It also shows full time employees have either higher chance of getting prosper loan or most of the loan applicants are under this category. On the other hand, there are few unemployed and Part time borrowers.This could be, due to their lower chance of getting approved to the loan because of their income or or there are fewer number of loan applicants under these categories.
## LoanStatus Percent
## 1 Cancelled 0.004477679
## 2 Chargedoff 10.739264765
## 3 Completed 34.096628308
## 4 Current 50.665830833
## 5 Defaulted 4.493798415
Even though most of the loan is active and completed, about 10.739265 % of the total loan is Chargedoff and 4.493798 % is defaulted.
I categorized Chargedoff and Defaulted loans as non performing loans for simplicity of analysis.
From the distributions of credit score we can see that, there are non performing borrowers across all credit scores from high to low. However, there are many borrowers with lower credit scores (< ~ 600) fall under the credit score distribution of non performing borrowers and relatively fewer borrowers with lower credit scores (< ~ 600) fall under the performing distribution.
## NULL
## IncomeRange PerformingCount NonPerformingCount
## 1 $0 381 236
## 2 $1-24,999 5444 1663
## 3 $25,000-49,999 26003 5452
## 4 $50,000-74,999 26897 3507
## 5 $75,000-99,999 15053 1528
## 6 $100,000+ 15690 1290
## NonPerformingPercentage
## 1 38.25
## 2 23.40
## 3 17.33
## 4 11.53
## 5 9.22
## 6 7.60
From this chart we can see an interesting trend of increasing performance with increasing income range. This could be explained by, borrowers with good incomes are more likely to pay their loans off. It is interesting to find that borrowers with no income are the most non performing borrowers (38.25 %) followed by borrowers with income ranges from $1-$24,999 (23.4 %) compared to all borrowers with registered income ranges.
It is clear that Most of the borrowers are under the 3 year loan period and relatively small number of borrowers borrowed for the 12 month period.
The Prosper rating distribution is almost normal with C the most freqent rating and AA the least frequent rating.
There is abig dip in listings from 2008 Q4 into 2009. This time period coincides with the collapse of Lehman brothers and the fallout of the global financial system.
I categorized the states as high loan and low loan states by number of borrowers (number of borrowers in thestate >1000 and number of borrowers in the state < 200). The above two distributions show the total loan amount and number of borrowers of the high loan states. As we can see from the histograms, CA,TX,NY and FL are the top states by total loan amount. This is consistent with the number of borrowers and I found it is related to the population size of the states. In order to check whether this high total loan amount is related to the population of the states I normalized the total loan amount by the total number of borrowers as the average loan amount per person as shown below.
From the above histogram, we can see that the average loan per person distribution is almost kind of flat and interestingly the normalized loan amount of CA, TX, NY, and FL is less than the the normalized loan amount of some states like NJ and MA. Therefore the higher total loan amount in larger states is related to the size of their populations.
It is interesting that 5 states (IA, ME,ND, SD, WY) have the least number of borrowers.
The loan amount shows generally an increasing pattern especially since the fourth quarter of 2009.
The estimated return increses with increasing estimiated effective yield and decreases with increasing estimated loss.
The monthly payment increases with the amount borrowed. This is expected since borrowers with higher loans need to pay more to pay off their debt in the specified loan period.
The estimated loss is low for borrowers with good prosper rating and high for those with bad prosper rating.
The plots show that, borrowers with good prosper ratings have higher credit scores and those with bad ratings have realtively lower credit scores.
Similarly, borrowers with good prosper rating have lower APRs and those with bad ratings have high APRs.
## [1] "The summary statistics of borrower APR is "
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
## [1] "The summary statistics of borrower rate is "
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The borrower APR and rate are linearly related. The average APR and rate are ~ 22% and 19 % respectively. The minimum 0 % interest rate is
strange and needs to be investigated further.
Over all the Borrower APR and interest rate have inverse relation with credit score.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
It is clear that borrowers who owe larger loan amounts borrow for longer durations. This is expected since they need longer time to pay off their loans or otherwise need to pay higher monthly paymemnts with shorter loan terms. The stars in the boxes show the average loan amount. From the summary statistics we can see the average loan amount is $8337 and the loan amount ranges from $1000 to $35000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
The monthly payment increases with loan amount. For the same loan amount the monthly payment amount is higher for shorter terms. This is expected since borrowers need to pay higher monthly payments in order to payoff the balance with in shorter terms. From the summary statistics we can see that the average monthly payment is ~ $272, and the maximum monthly payment is $2252. The minimum monthly payment which is 0.0 is clearly an outlier associated with cancelled loans.
## Loan_range Number_of_borrowers
## 1 (0,5e+03] 51824
## 2 (5e+03,1e+04] 31136
## 3 (1e+04,1.5e+04] 20033
## 4 (1.5e+04,2e+04] 5659
## 5 (2e+04,2.5e+04] 4605
## 6 (2.5e+04,3e+04] 177
## 7 (3e+04,3.5e+04] 503
It is clear to see that most borrowers have loans in the range from 1-5000 for aperiod of 36 months. Generallly most borrowers borrowed for 36 and 60 months and there are no borrowers borrowed more than 15000 for a period of 12 months.
It is interesting to see that most prosper loan amounts are in the range of 0,5000 followed by 5000 - 10,000 and most borrowers are employed.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.080 0.112 0.366 29084
The estimated loss increases with decreasing prosper rating. For same proper rating of (E,D,C,B), the median estimated loss is a little bit higher with shorter loan periods and vice versa for A and AA ratings. The average estimated loss is 8 % and the maximum estimated loss is 36.6 %.
The borrower APR decreases with increasing prosper rating grade. At the same rating, for boorrowers with lower rating (ratings E,D,C and B), the median borrower APR decreases with loan period and increases with loan period for borrowers with rating A and AA.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.116 0.162 0.169 0.224 0.320 29084
The estimated effective yield obtained from borrowers with lower prosper ratings is generally higher. This might be due to the fact that there are fewer customers with good ratings and also the APR and rate is higher for the borrowers with lower prosper ratings as we have already seen.
Insterestingly, for borrowers with the same rating, the estimated yield increases with increasing loan period. This is also explained by the fact that the more borrowers stayed in the loan the more interest they would pay and hence increase the estimated effective yield. The all over average estimated effective yield is ~ 17 % which is close to the difference between average estimated return and average estimated loss.
The loan amount borrowed until the third quarter of 2010 was less than $10000 and was only for a period of 36 months. It is interesting to see that both higher loan amount of more than 10000 and flexible payment plans were introduced after the fourth quarter of 2010.
The average APR of home owners is less than the others over the whole year. This might be associated with the credit history of the borrowers.
The average credit score of borrowers with home owners is above 650 over the whole year.
It is interesting to see that the average estimated loss is higher for borrowers with higher interest rates.
From the above correlation table we can see, there is a strong negative relatioship between prosper rating and estimated loss.
##
## Call:
## lm(formula = y ~ ., data = ddf2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.142468 -0.010746 -0.003912 0.006014 0.142047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.716e-02 1.719e-03 44.895 <2e-16 ***
## EstimatedEffectiveYield 3.722e-01 2.024e-03 183.914 <2e-16 ***
## EstimatedLoss -2.612e-01 5.948e-03 -43.908 <2e-16 ***
## ProsperRating..numeric. -6.171e-03 1.938e-04 -31.836 <2e-16 ***
## EmploymentStatusDuration -6.490e-06 7.429e-07 -8.736 <2e-16 ***
## CreditScoreRangeLower 4.035e-06 1.843e-06 2.189 0.0286 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0177 on 59994 degrees of freedom
## Multiple R-squared: 0.6604, Adjusted R-squared: 0.6604
## F-statistic: 2.334e+04 on 5 and 59994 DF, p-value: < 2.2e-16
I used multivariat linear regression to see how the estimated return related to
other variables describing the borrowers. The detail results of the model is described in the summary. Though the model is not well tunned, it explains up to 69% of the variation in the estimated returns.
The loan APR ranges from less than 5 % up to about 51 %. However, there are only very few borrowers with such high and lower APR values. Most borrowers seems to have a APR from 10 % to 30 %. The median APR value is ~21 %.
When I investigated the demography of borrowers by state, I found a higher value of total loan amount and larger number of borrowers especially for the states CA, TX, NY and FL. This led me to ask is there any special reason for this? are people from these states tend to borrow more likely compared to people from other states? In order to answer this question and compare borrowers by state, I normalized the total loan amount
by the number of borrowers to see the average loan amount per person. As can be seen above the distribution of the normalized loan amount is almost nearly uniform and the states even found to have normalized loan amount less than some other states. This shows that the higher number of borrowers in these states might be due to higher number population in the states so that the number of applicants is also higher.
The loan amount borrowed until the third quarter of 2010 was less than 10000 and was only for a period of 36 months. It is interesting to see that both higher loan amount of more than 10000 and flexible payment plans were introduced after the fourth quarter of 2010.
The Prosper loan data set has 113,937 rows and 81 features. I explored some of the 81 variables and tried to find interesting relationships among the
variables. My focus of investigation was mostly on the profile of the borrowers such as income range, employment, credit score etc.. and the features of the loan such as APR, estimated loss, loan amount with time etc..,. I used ggplot to check how these variable distributed and related. On the process of exploration, I found many missing values in the data and I handeled the problem using R data cleaning tools such as from the deplyer package. Another challenge I encountered was the datset contains a lot of outliers and noise that makes hardly easy to extract the information I want using scatter or bar plots directly from the variables of interest. I handeled this problem by using transformations such as scaling the variables from linear to logarithimic, by averaging and choosing more approprioate ploting styles for example using box plot instead of histograms and scatter plots. In this way I was able find informative trends that explains the relationships between the variables of interest.
From the exploration of the data, I found the APR varies from less than 1 % upto a little higher than 50 %. The APR is related to the credit score of the borrowers. It is foud that borrowers with high credit scores have good prosper rating and lower APR. It is also found that most borrowers have loans in the range from 1-5000 for aperiod of 36 months and the loan amount grown over time since the second quarter of 2009. I estimated the percentage of loan status and found that about 10.74 % of the loan is Chargedoff and 4.5 % is defaulted.
Finally,I used multivariat linear regression to predict the estimated return. and the model explains up to 69% of the variation in the estimated returns. Even though this rough estimation gives a clue about the possibility of using machine learning to investigate the data, I recommend it to be done more carefully with proper feature selection, careful algorithm choices and proper parametric optimization. Since the data has many features, using combination of principal component analysis (PCA) to reduce the dimensionality and univariate feature selection could be helpful to develop a well tuned model of the data.